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ABSTRACT 

Conventional Web archives are created by periodically crawl- 
ing a Web site and archiving the responses from the Web 
server. Although easy to implement and commonly de- 
ployed, this form of archiving typically misses updates and 
may not be suitable for all preservation scenarios, for exam- 
ple a site that is required (perhaps for records compliance) 
to keep a copy of all pages it has served. In contrast, trans- 
actional archives work in conjunction with a Web server to 
record all content that has been served. Los Alamos Na- 
tional Laboratory has developed SiteStory, an open-source 
transactional archive written in Java that runs on Apache 
Web servers, provides a Memento compatible access inter- 
face, and WARC file export features. We used Apache's 
ApacheBench utility on a pre-release version of SiteStory 
to measure response time and content delivery time in dif- 
ferent environments and on different machines. The per- 
formance tests were designed to determine the feasibility 
of SiteStory as a production-level solution for high fidelity 
automatic Web archiving. We found that SiteStory does 
not significantly affect content server performance when it 
is performing transactional archiving. Content server per- 
formance slows from 0.076 seconds to 0.086 seconds per Web 
page access when the content server is under load, and from 

0. 15 seconds to 0.21 seconds when the resource has many 
embedded and changing resources. 

Categories and Subject Descriptors 

H.3.5 [Online Information Services]: Data Sharing 

General Terms 

Design, Experimentation 
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1. INTRODUCTION 

Web archiving is an important aspect of cultural, histori- 
cal, governmental, and even institutional memory. The cost 
of capturing Web-native content for storage and archiving 
varies and is dependent upon several factors. The cost of 
human-harvested Web archiving has prompted research into 
automated methods of digital resource capture. The tradi- 
tional and classic method of automatic capture is the Web 
crawler, but recent migrations toward more personalized 
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and dynamic resources have rendered crawlers ineffective at 
high-fidelity capture in certain situations. For example, a 
crawler cannot capture every representation of a resource 
that is customized for each user. Transactional archiving 
can, in some instances, provide an automatic archiving so- 
lution to this problem where crawlers fall short. 

1.1 Transactional Archiving 

The purpose of a transactional archive (TA) is to archive 
every representation of a resource that a Web server dis- 
seminates. A client does an HTTP GET on a URI and the 
Web server returns the representation of the resource at that 
time. At dissemination time, it is the responsibility of TA 
software to send the representation seen by the client to an 
archive. In this way, all representations returned by the Web 
server can be archived. If storing all served representations 
is costly (e.g., a high-traffic site with slowly changing re- 
sources), it is possible to optimize a TA in a variety of ways: 
store only unique representations, store every n th represen- 
tation, etc. 

Figure [I] provides a visual representation of a typical page 
change and user access scenario. This scenario assumes an 
arbitrary page that will be called P changes at inconsistent 
intervals. This timeline shows page P changes at points 
Ci, C2, C3, C4, and C5 at times £2, £6, £s, £10, and £13, 
respectively. A user makes a request for P at points 0\ , O2 , 
and O3 at times £3, £5, and £n, respectively. A Web crawler 
(that captures representations for storage in a Web archive) 
visits P at points V\ and V2 at times £4 and £9, respectively. 

Since Ox occurs after change Ci, an archived copy of C\ 
is made by the TA. When O2 is made, P has not changed 
since 0\ and therefore, an archived copy is not made since 
one already exists. The Web crawler visits V\ captures Ci, 
and makes a copy in the Web archive. In servicing Vi, an 
unoptimized TA will store another copy of C\ at £4 and an 
optimized TA could detect that no change has occurred and 
not store another copy of C\. 

Change C2 occurs at time £6, and C3 occurs at time tg. 
There was no access to P between tQ and £s, which means 
C2 is lost - an archived copy exists in neither the TA nor the 
Web crawler's archive. However, the argument can be made 
that if no entity observed the change, should it be archived? 
Change C3 occurs and the representation of P is archived 
during the crawler's visit Vb, and the TA will also archive 
C3. After C4, a user accessed P at O3 creating an archived 
copy of Ca in the TA. 

In the scenario depicted in Figure [I] the TA will have 
changes Ci, C3, C4, and a conventional archive will only 
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Figure 1: User and crawler accesses control the archival interval, capturing each returned representation. 



have Ci, C3. Change C2 was never served to any client (hu- 
man or crawler) and is thus not archived by either system. 
Change C5 will be captured by the TA when P is accessed 
next. 



1.2 SiteStory 

Los Alamos National Laboratory has developed 
SiteStoruM an open-source 



transactional Web archive. 



Figure [2] illustrates the components and process of 
SiteStory. First, mod_sitestory is installed on the Apache 
server that contains the content to be archived. When the 
Apache server builds the response for the requesting client, 
mod_sitestory sends a copy of the response to the SiteStory 
Web archive, which is deployed as a separate entity. This 
Web archive then provides Memento-based access (see 
Section [2| to the content served by the Apache server with 
mod_sitestory installed, and the SiteStory Web archive is 
discoverable from the Apache Web server using standard 
Memento conventions (see Section 4 of |14| ). 

Sending a copy of the HTTP response to the archive is 
an additional task for the Apache Web server, and this task 
must not come at too great a performance penalty to the 
Web server. The goal of this study is to quantify the addi- 
tional load mod_sitestory places on the Apache Web server 
to be archived. 

1.3 Organization and Purpose 

This Technical Report details the work performed with 
SiteStory, and the results of the performance tests and bench- 
marking performed as part of a feasibility study. The rest 
if this Technical Report is organized as follows: Section [2] 
discusses the contributions of prior research efforts. Section 
[3] discusses the experiment design and execution. Section [4] 
details the results and findings of the experiment. Finally, 
Section [5] summarizes the findings and impacts of this Tech- 
nical Report, and outlines the upcoming extensions of this 
work. 



2 http: //mementoWeb.github. com/Sit eStory/ 



2. PRIOR WORK 

Extensive research has been done to determine how Web 
documents change on the Web. Studies of "wild" pages (such 
as Cho's w ork with crawlers [4] or Olston's work in recrawl 
scheduling [To]) have shown that pages change extremely 
frequently. Figure [3] (taken from Olston's paper) visually 
shows the ephemeral nature of information contained within 
a Web page. In this figure, one can see that not only do pages 
change very frequently, but one can see that pages change in 
different ways. In this figure, Page A has small sections of 
content that change rapidly. This behavior is called "churn". 
Page B has longer- lived content, but additional content is 
added to the page over time. This is called "scroll" behavior. 

Prior research has focused on crawlers and robots to find 
pages and monitor their change patterns [5J [6j [17] . These 
crawlers follow the links on pages to discover other pages and 
archive and recrawl the discovered pages over time to com- 
pile an archive. This method is unsuitable for an intranet 
that is closed to the public Web; crawlers cannot access the 
resources of archival interest [8]. As a way to have finer 
control over the archival granularity, transactional archiving 
should be used. Transactional archiving implementations 
include TTApache 5 and page Vault [7]. TTApache is a 
server-side solution and page Vault operates on the client- 
side. For each user access of a Web resource, TTApache 
compares a hash of the content and stores a copy at the 
server if it has changed, and page Vault determines if the 
content has changed by rendering the content on the client 
and archiving it locally if needed. These implementations 
were also shown not to substantially increase the access time 
seen by Web users; page Vault saw an increase of access time 
from 1.1 ms to 1.5 ms, and TTApache saw a 5-25% increase 
in response time, depending on requested document size. 

Memento is a joint project between Old Dominion Uni- 
versity and Los Alamos National Laboratory. The Memento 
Framework defines HTTP extensions that allow content ne- 
gotiation in the dimension of time [l5 16 . When used with 
Memento-aware user agents like MementoFox fiTI , users can 
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Figure 2: SiteStory consists of two parts: mocLsitestory which is installed on the Apache server 
to be archived, and the transactional archive itself. Image taken from the SiteStory GitHub at 
http : //memento Web . github . com/SiteStory/. 
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Figure 3: Page A shows rapidly appearing and 
disappearing content, while Page B shows longer- 
lived content. This image was originally published 
in Olston's 2008 paper (lO) . 



set a desired datetime in the past and browse the Web as 
it existed at (or near) that datetime. Unlike other, single- 
archive applications like DifTIE [T5] [l3] , Past Web Browser 
[9], or Zoetrope [l], Memento provides an multi-archive ap- 
proach to presenting the past Web. Integrating multiple 
Web archives can give a more complete picture of the past 
Web [2]. 

3. EXPERIMENT DESIGN 

SiteStory was tested with a variety of loads, a variety of 
resources, and on two machines with different configurations 



and specifications. Three different tests were run during 
the experiment. The details of the experiment setup and 
execution is included in this section. 

3.1 Experiment Machines 

The SiteStory benchmarking experiment was conducted 
with a pre-release version of SiteStory installed on two ma- 
chines, PCI and PC2. Both machines ran the prefork ver- 
sion of the Apache 2 Web server, and in both cases the 
mod_sitestory-enabled Apache server provided content from 
localhost : 8080 and the SiteStory archive was installed at 
localhost : 9999 Even though we installed SiteStory on 
different ports on the same machine, it can be installed 
on two different machines. Although the developers have 
experimented with optimizations discussed in Section 
SiteStory currently archives all returned representations re- 
gardless of whether the representation has changed or not. 

PCI has a single core 2.8 GHz processor. PCI has no 
memory remaining on the server; it is 100% utilized. PCI 
represents a worst-case scenario for a server - it has been 
completely bogged down with background processes. To 
simulate this load, a script runs throughout the experiment 
that initiates requests for Web pages to create the load on 
the server. PC2 has two 1GHz processors and is unhindered 
during the testing by additional requests. Both of these ma- 
chines run Linux Ubuntu; PCI ran version 11, while PC2 
ran version 10. These machines complement each other by 
providing two extremes of a potential content server: an 
overtaxed, under performing server and a high performing, 
unburdened server. The results from each of these machines 
are provided in Section^] 



3.2 Experiment Runs 

Three separate experiments were run, and each experi- 
ment was run on both machines PCI an PC2. The first 
experiment tests the throughput of a content server enabled 
with SiteStory software. This experiment ran for 45 days, 
from January 14th, 2012 to February 28th, 2012. This is 
described in Section pP| The second experiment performs 
a series of accesses to 100 static resources to test the access 
rates, response times, and round trip times possible. This 
test was run from March 1st, 2012 until March 16th, 2012. 
This is described in Section [3 .4| The third experiment per- 
forms a series of accesses to 100 dynamic, constantly chang- 
ing set of 100 resources to demonstrate a worst-case scenario 
for SiteStory - everything is archived on each access. This 
test was also run from March 1st, 2012 until March 16th, 
2012. This final experiment is described in Section [3. 5| 



3.3 Connection Handling: ab 

This first experiment to measure the differences in through- 
put when SiteStory is running and when SiteStory is turned 
off was run twice a day (at 0700 and 1900 EST) for 45 days, 
resulting in 90 datapoints. The experiment utilized the ab 
(ApacheBench) toorl This utility makes N number of con- 
nections as quickly as possible with C concurrency, where 
N and C are variables specified by the user. The ab util- 
ity records the response, throughput, and other server stats 
during a test. Essentially, the ApacheBench utility issues 
HTTP GET requests for content as quickly as possible to 
establish a benchmark for performance. A run in ab provides 
output similar to that seen in Appendix [A] 

Three different HTML resources were targeted with this 
test: a small, medium, and large file of sizes lkB, 250 kB, 
and 700 kB. These resource sizes were chosen after a brief 
survey of the average file sizes in a corporate intranet pro- 
vided a minimum, average, and maximum file size of a Web 
pagej^ Four different connections (1,000, 10,000, 100,000, 
216,000) and four concurrencies (1, 100, 200, 450) were used. 
The connection numbers of 1,000, 10,000, and 100,000 were 
chosen arbitrarily. The connection number of 216,000 was 
chosen after observing this as 201 l's yearly maximum hourly 
rate of Website accesQ Similarly, the access concurrencies 
of 1, 100, and 200 were chosen arbitrarily, but the concur- 
rency of 450 was chosen as the observed average expected 
number of concurrent accesses to a sit^J For simplicity and 
brevity, this work will only consider a subset of these test 
runs. This report discusses the runs of 10,000 connections 
with concurrencies 1 and 100, and runs of 216,000 connec- 
tions with concurrencies 1 and 100. This subset of results 
illustrates typical results of all other tests. 

The three resources were modified between each set of 
connections to ensure the resource is archived each run. To 
modify the resources, a script was run to update the past 



http : / /httpd . apache . org/ docs/2 . 0/ programs/ ab . html 

3 These file sizes were empirically determined in an internal 
MITRE study. 

4 This access rate was chosen after observing the average 
2011 Website access rate within MITRE's corporate in- 
tranet. 

5 This access concurrency was chosen after observing the av- 
erage 2011 Website access rate within MITRE's corporate 
intranet. 



run time of the script on each page, and change the image 
that was embedded in the page. These modifications would 
ensure that not only the image was changed and able to 
be re-archived, but the surrounding HTML was changed, as 
well. Since SiteStory re-archives content whenever a change 
is detected, each test run results in each resource being re- 
archived. It is essential to make sure the resource is re- 
archived to observe the effect of an archival action on the 
content server performance. 

Each ab test was performed twice: once while SiteStory 
was turned on, and once while it was turned off. This shows 
how SiteStory affects the content server performance. A sub- 
set of the results are provided in Figure [4] The red lines rep- 
resent the runs in which SiteStory was turned off, while the 
blue lines represent the runs in which SiteStory was turned 
on. Each entry on the x-axis represents an independent test 
run. The y-axis provides the amount of time it took to ex- 
ecute the entire ab run. The horizontal lines represent the 
averages over the entire experiment. The dotted, vertical 
green lines indicate machine restart times due to power out- 
ages. The power outages were noted to show when a cache 
and memory resets may have occurred that could impact the 
performance of the machines. 

To illustrate how SiteStory affects the content server's per- 
formance, please reference Figures [4] and [5] that portray the 
changes in the total run time of the ab test when SiteStory is 
on (actively archiving served content) and off (not archiving 
served content). 

3.4 100 Static Resources: Clearing the Cache 

The second experiment uses the curl command to access 
100 different HTML resources, none of which change. After 
running the ab tests in Section [33] a theory was formulated 
that a reason for some of the anomalies was from server 
caching. This additional test shows the effect of clearing 
the server cache on SiteStory by accessing a large number of 
large files in sequence. This access essentially thrashes the 
server cache. Each resource has text, and between and 
99 images (the 0th resource has images, the 1st resource 
has 1 image, etc.). These resources were generated by a 
Perl script that constructed 100 different HTML pages and 
embedded between 0-99 different images in the generated 
resources. The resources were created with different sizes, 
and different numbers of embedded resources to demonstrate 
how SiteStory affects content server performance with a va- 
riety of p age si zes and e mbedded images. 

Figures 6(a) and |7(a) demonstrate the accesses of the 100 
resources. The dark blue and red lines indicate the average 
run time for accessing a resource (in seconds). The filled 
areas around the lines are the standard deviation (a) of the 
observations over the duration of the experiment. 



100 Changing Resources: Worst-Case Sce- 
nario 



3.5 



The same experiment from Section [3. 4 1 was run in which 
each resource changed between runs to provide a "worst case 
scenario" of data connections vs. archiving and run time. 
A script was executed in between each run in which each 
resource was updated to make SiteStory archive a new copy 
of the resource. This means that each access resulted in a 
new archived copy of each resource. The results of this run 
are shown in Figures |8(a)| and |9(a) 

Note that Figures gfand[9] are "burdened." An artificial 
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Figure 4: Total run time for 10,000 Connections. 
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(a) Total run time for the ab test with 216,000 connections and 1 concurrency. 
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(b) Total run time for the ab test with 216,000 connections and 100 concurrency. 
Figure 5: Total run time for 216,000 Connections. 



100 Resources Test Result for PC1 (Averages) 
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(a) Total access time for the 100 static resources on PCI. 



100 Resources Test Result for PC1 Change Algorithm (Averages) 
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(b) Total access time for the 100 changing resources on PCI. 



Figure 6: 100 resources accessed on PCI. Resource n has n embedded images. 



100 Resources Test Result for PC2( Averages) 
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100 Resources Test Result for PC2 Change Algorithm (Averages) 
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(b) Total access time for the 100 changing resources on PC2. 



Figure 7: 100 resources accessed on PC2. Resource n has n embedded images. 



user load was enduced on the servers to simulate a produc- 
tion environment in which many users are requesting con- 
tent. A script was run during the test that made curl calls to 
the server pages to induce the load. The impact of SiteStory 
operating in a burdened environment is observed in Figures 
Uland EH 

4. RESULTS 

After the completion of the experiment, the results of each 
test were analyzed. Immediately, patterns emerge between 
graphs and tests demonstrating the effect of SiteStory on 
content server performance. This section explores the results 
of the tests and makes note of the patterns. From these 
results, one can conclude whether or not SiteStory affects 
its host content server in an acceptable manner. 

4.1 ab Results 

For the ApacheBench tests described in Section pO| sev- 
eral obvious patterns emerge. Primarily, there is little sep- 
aration between the total run times of the ab tests when 
SiteStory is on and when SiteStory is off. One can observe 
only minor differences in the plotted results. The results 
differ very little between any given run of the tests, and 
the averages across the experiment are almost identical in 
all tests. In the run of 10,000 connections and 1 concur- 
rency, the average total run times were 6.156 seconds when 
SiteStory was off, and 6.214 seconds when SiteStory was on. 
In the run of 10,000 connections and 100 concurrency, the 
average total run time was 2.4 seconds when SiteStory was 
off, and 2.42 seconds when SiteStory was on. In the run 
of 216,000 connections and 1 concurrency, the average run 
time was 8.905 seconds when SiteStory was off, and 8.955 
seconds when SiteStory was on. In the run of 216,000 con- 
nections and 100 concurrency, the average total run time was 
4.698 seconds when SiteStory was off, and the average total 
run time was 4.706 when SiteStory was on. This indicates 
SiteStory does not significantly affect the run time of the ab 
statistics, and therefore does not affect the performance of 
the content server with regard to content delivery time. 

Additionally, the concurrency of 1 resulted in more con- 
sistent executions across each run whereas the runs with a 
concurrency of 100 are more inconsistent, as indicated by 
the spikes in runtime. This could potentially be because 
of server caching, connection limitations, or even machine 
memory restrictions. The runs of 100 concurrency also be- 
gin with a much longer total run time before dropping sig- 
nificantly and leveling out at runs 9 and 10. This is due 
to additional processes running on the experiment machines 
that induced extra load in runs 1-8. However, the spikes 
and inconsistencies do not affect a single run, and do not 
affect only the runs in which SiteStory is on or those when 
SiteStory is off. As such, these anomalies are disregarded 
since they affect both runs. 

Finally, the runs of 216,000 connections take much longer 
to complete than the runs of 10,000 connections - specif- 
ically, 2.736 seconds longer, on average. This is intuitive 
since more connections should take longer to run. Addi- 
tionally, the runs of concurrency 1 take 3.9 seconds longer 
than the runs of 100 concurrent connections. By executing 
more connections in parallel, the total run time is intuitively 
shorter. 

The ab test provides evidence that SiteStory does not sig- 
nificantly affect server content delivery time. As such, a 



production server can implement SiteStory without users 
observing a noticeable difference in server performance. 

4.2 100 Resource Results 

The runs of the 100 resources are more interesting, and 
provide a deeper insight into how SiteStory affects the server's 
performance than the ab test. This section examines the re- 
sults of both the static and changing resource tests, as they 
provide interesting contrasts in performance. The results are 
listed in Table [T] 

When compa ring the unburd ened and burdened results 
(such as Figure [6(a) 



a) ) , it was observed that 



and 0.086 seconds higher 



vs. Figure 

the average run times are 0.07 
when the server is under load and SiteStory is off and on, 
respectively. Additionally, a between the accesses is much 
greater; 0.1292 and 0.1767 greater when SiteStory is on and 
off, respectively, as indicated by the wider standard devia- 
tion shown on the Figures. 

When comparing the unchanging vs changing resources 
(such as Figure 6(a) vs. 6(b)), it is apparent that a is, on 



average, two times higher for the changing resources than 
the unchanging resources. (The average a for unchanging 
resource is 0.0839 and 0.1680 for changing resources.) Ad- 
ditionally, the average access times when SiteStory is off 
remains approximately the same when the resources change 
or remain the same. The interesting result is that the aver- 
age access time increases from 0.15 seconds per GET to 0.21 
seconds per GET for the changing resources when SiteStory 
is on. This is intuitive considering SiteStory needs to re- 
archive the accessed content during an access when the re- 
source changes. 

When comparing the two machines, PCI and PC2 (such 
as Figure 6(a) vs. 9(b)), one sees that PC2 gives a nearly 



negligible access time, while PCI gives a measurable access 
time. This is because PC2 is a dual core machine and can 
handle the additional load more quickly, while PCI must 
context switch between processes, causing an increased de- 
lay. 



The most important observation in any of the Figures 6 (a 



9(b) is that the run time of this test is approximately 0.5 



seconds higher on average when SiteStory is on vs. when 
SiteStory is off. This number is reached by comparing the 
difference in average run time for each test when SiteStory 
is on vs. off. For each on-off pair, the average difference was 
taken to reach the approximate 0.5 second difference across 
all tests. That is, the difference between the average run 
times of the tests in Figures |6 (a) | when SiteStory is running 
(red) vs when SiteStory is off (blue) is 0.08 seconds. When 
the same comparison is performed across all tests and the av- 
erage of these results is taken, an overall impact of SiteStory 
on server performance is realized. 

Each Figure begins with SiteStory off taking more time 
than when SiteStory is on, but this can be attributed to 
experiment anomaly or similar server access anomaly. In- 
evitably, the run time when SiteStory is on becomes slower 
than when SiteStory is off as the resource size increases. This 
demonstrates that the performance difference of a server 
when SiteStory is on vs. off is worse when there is a large 
amount of embedded resources, such as images. PCl's aver- 
age page access time increases by, on average, 0.006 seconds 
per embedded image. One could come to the conclusion that 
servers providing access to image-laden resources would see 
the biggest performance decrease when utilizing SiteStory. 



100 Resources Test Result for PC1 Change Algorithm (Averages) 
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(a) Total access time for the 100 static resources on a burdened PCI. 



100 Resources Test Result for PC1 (Averages) 
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(b) Total access time for the 100 changing resources on a burdened PCI. 



Figure 8: 100 resources accessed on a burdened PCI. Resource n has n embedded images. 



Table 1: 100 Resource Test Results 



Case Avg 


Unburdened Run Time Unburdened a 


Avg. Burdened Run Time 


Burdened a 


Static Resources 


PCI, SS Off 


0.121 


0.0254 


0.192 


0.2021 


PCI, SS On 


0.206 


0.1811 


0.292 


0.3103 


PC2, SS Off 


0.056 


0.0011 


0.056 


0.0001 


PC2, SS On 


0.056 


0.0009 


0.056 


0.0001 


Changing Resources 


PCI, SS Off 


0.132 


0.0346 


0.225 


0.2174 


PCI, SS On 


0.354 


0.4244 


0.292 


0.6137 


PC2, SS Off 


0.057 


0.0021 


0.056 


0.0002 


PC2, SS On 


0.057 


0.0016 


0.056 


0.0002 



100 Resources Test Result for PC2 Change Algorithm (Averages) 
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(a) Total access time for the 100 static resources on a burdened PC2. 



100 Resources Test Result for PC2( Averages) 
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(b) Total access time for the 100 changing resources on a burdened PC2. 



Figure 9: 100 resources accessed on a burdened PC2. Resource n has n embedded images. 



5. CONCLUSIONS 

In this work, SiteStory was stress tested and benchmarked. 
The results of this study have shown that SiteStory does not 
significantly affect the performance of a server. While dif- 
ferent servers and different use cases cause different perfor- 
mance effects when SiteStory is archiving content, the host 
server is still able to serve sites in a timely manner. The type 
of resource and resource change rate also affects the server's 
performance - resources with many embedded images and 
frequently changing content are affected most by SiteStory, 
seeing the biggest reduction in performance. 

These results are observed in Figures [4] - [5] as well as Ta- 
ble [l] SiteStory does not significantly increase the load on 
a server or affect its ability to serve content - the response 
times seen by users will not be noticeably different in most 
cases. However, these graphs demonstrate the impact of 
SiteStory on performance, albeit small - larger resources 
with many embedded resources take longer to serve when 
SiteStory is on as opposed to when SiteStory is off due to 
the increased processing required of the server. However, 
the significant finding of this work is that SiteStory will not 
cripple, or even significantly reduce, a server's ability to pro- 
vide content to users. Specifically, SiteStory only increases 
response times by a fraction of a second - from 0.076 sec- 
onds to 0.086 seconds per access when the server is under 
load, and from 0.15 seconds to 0.21 seconds when the re- 
source has many embedded and changing resources. These 
increases will not be noticed by human users. 
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APPENDIX 

A. SAMPLE RUN OF APACHEBENCH 

A sample run of the ApacheBench (ab) is provided below: 

ab -n 1000 -c 1 \protect\vrule widthOpt\protect\href {http : //localhost/time .phpMhttp : //localhost/time .php} 
This is ApacheBench, Version 2.3 $Revision: 655654 $ 
Copyright 1996 Adam Twiss, Zeus Technology 

Ltd , \protect\vrule widthOpt\prot ect\href {http : / / www . zeustech . net/}{http : / / www . zeustech . net/} 
Licensed to The Apache Software Foundation, 

\protect\vrule widthOpt\protect\href {http: //www. Apache . org/}{http: //www. Apache . org/} 

Benchmarking localhost (be patient) 
Completed 100 requests 
Completed 200 requests 
Completed 300 requests 
Completed 400 requests 
Completed 500 requests 
Completed 600 requests 
Completed 700 requests 
Completed 800 requests 
Completed 900 requests 
Completed 1000 requests 
Finished 1000 requests 

Server Software: 
Server Hostname: 
Server Port : 

Document Path: 
Document Length: 

Concurrency Level: 
Time taken for tests: 
Complete requests : 
Failed requests: 
Write errors: 
Non-2xx responses : 
Total transferred: 
HTML transferred: 
Requests per second: 
Time per request : 
Time per request: 

Transfer rate: 

Connection Times (ms) 







min mean 


[+/-sd] 


median 


max 


Connect : 








0.0 








Processing: 








0.0 





1 


Waiting: 








0.0 








Total : 








0.0 





1 



Percentage of the requests served within a certain 

time (ms) 

50°/o 
66% 
75% 
80% 
90% 
95% 
98% 
99% 
100% 1 (longest request) 



Apache/2.2. 16 

localhost 

80 

/HitPage . html 
298 bytes 

1 

0.260 seconds 

1000 





1000 

501000 bytes 

298000 bytes 

3842.34 [#/sec] (mean) 

0.260 [ms] (mean) 

0.260 [ms] (mean, across all 

concurrent requests) 
1879.90 [Kbytes/sec] received 



